Skip to content

Conversation

@JJJYmmm
Copy link
Contributor

@JJJYmmm JJJYmmm commented Oct 26, 2025

This PR adds support for the Qwen3-VL series, including both the dense and MoE variants.
The original implementation was contributed by @yairpatch and @Thireus (see #16207). @LETS-BEE also helped address issues such as weights loading.

In this PR, I’ve fixed several algorithmic implementation details (e.g., deepstack), added support for MRoPE-Interleave, and performed final code cleanup.

JJJYmmm and others added 2 commits October 26, 2025 19:18
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Oct 26, 2025
@taronaeo taronaeo linked an issue Oct 26, 2025 that may be closed by this pull request
4 tasks
@Thireus
Copy link
Contributor

Thireus commented Oct 26, 2025

Thank you @JJJYmmm! Test builds:

https://github.com/Thireus/llama.cpp/releases - tagged tr-qwen3-vl-6

@ddh0
Copy link
Contributor

ddh0 commented Oct 27, 2025

Thank you! Looking forward to this so we (myself and @rujialiu) can progress with #16600 :)

@xbl916
Copy link

xbl916 commented Oct 27, 2025

Thank you @JJJYmmm! Test builds:

https://github.com/Thireus/llama.cpp/releases/tag/tr-qwen3-vl-6-b7106-495c611

For some reason, this version's OCR capability is not as good as the previous LETS-BEE version; it noticeably misses characters and exhibits infinite repetition.

iosub added a commit to iosub/ollama that referenced this pull request Oct 27, 2025
Integrates Qwen3-VL and Qwen3VL-MoE architecture support from upstream.
Implements IMROPE (Interleaved Multi-resolution RoPE) for vision models.
Adds deepstack layer support for visual feature processing.

Changes include:
- New architecture types: LLM_ARCH_QWEN3VL, LLM_ARCH_QWEN3VLMOE
- IMROPE rope type for vision position encoding
- Deepstack visual feature handling in clip.cpp
- GGML CUDA kernels for IMROPE
- Tensor mappings for Qwen3VL architecture

Upstream PR: ggml-org/llama.cpp#16780
Contributors: @JJJYmmm @yairpatch @Thireus @LETS-BEE
@theo77186
Copy link

the question is: are the fixes in #16745 included in this PR? If not, the full performance of the model will only be reached with PR 16475 merged.

@psi00
Copy link

psi00 commented Oct 27, 2025

Thank you @JJJYmmm! Test builds:

https://github.com/Thireus/llama.cpp/releases/tag/tr-qwen3-vl-6-b7106-495c611

I'm still getting an unknown model architecture error here?

ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Apps\llama.cpp\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Apps\llama.cpp\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Apps\llama.cpp\ggml-cpu-haswell.dll
build: 7106 (495c6115) with clang version 19.1.5 for x86_64-pc-windows-msvc
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3060) (0000:07:00.0) - 11240 MiB free
llama_model_loader: max stdio successfully set to 2048
llama_model_loader: loaded meta data with 21 key-value pairs and 399 tensors from C:\models\llama.cpp\Qwen3-VL-8B-Instruct.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = Qwen3-VL-8B-Instruct
llama_model_loader: - kv   1:                                    version u32              = 3
llama_model_loader: - kv   2:                               tensor_count u32              = 399
llama_model_loader: - kv   3:                               general.type str              = model
llama_model_loader: - kv   4:                         general.size_label str              = 8B
llama_model_loader: - kv   5:                               bos_token_id u32              = 151643
llama_model_loader: - kv   6:                               eos_token_id u32              = 151645
llama_model_loader: - kv   7:                                 hidden_act str              = silu
llama_model_loader: - kv   8:                                hidden_size u32              = 4096
llama_model_loader: - kv   9:                          intermediate_size u32              = 12288
llama_model_loader: - kv  10:                    max_position_embeddings u32              = 262144
llama_model_loader: - kv  11:                        num_attention_heads u32              = 32
llama_model_loader: - kv  12:                          num_hidden_layers u32              = 36
llama_model_loader: - kv  13:                        num_key_value_heads u32              = 8
llama_model_loader: - kv  14:                               rms_norm_eps f32              = 0.000001
llama_model_loader: - kv  15:                                 rope_theta f32              = 5000000.000000
llama_model_loader: - kv  16:                             attention_bias bool             = false
llama_model_loader: - kv  17:                                   head_dim u32              = 128
llama_model_loader: - kv  18:                        tie_word_embeddings bool             = false
llama_model_loader: - kv  19:                                 vocab_size u32              = 151936
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type q4_0:  254 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0 (guessed)
print_info: file size   = 4.29 GiB (4.50 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'Qwen3-VL-8B-Instruct'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'C:\models\llama.cpp\Qwen3-VL-8B-Instruct.Q4_0.gguf', try reducing --n-gpu-layers if you're running out of VRAM
main: error: unable to load model```

@i4TsU
Copy link

i4TsU commented Oct 27, 2025

the question is: are the fixes in #16745 included in this PR? If not, the full performance of the model will only be reached with PR 16475 merged.

they are not, as @FMayran and @rujialiu are still figuring out the best way to implement a fix properly, once and for all :) . you can cherry-pick the changes from #16745 without any problems though, and then just build it yourself for a temporary implementation, though make sure to check the issues raised in the last 24-48 hours re why its not a real 100% fix

@PaymonHossaini
Copy link

PaymonHossaini commented Oct 27, 2025

@psi00

I have managed to get Qwen3-LV-30B-A3B-instruct running on Ubuntu just now (specifically with a ryzen ai max+ 395 and vulkan). Did you compile your own GGUF/mmproj.gguf using convert_hf_to_gguf.py?

How I prepared mine below

huggingface-cli download Qwen/Qwen3-VL-30B-A3B-Instruct --local-dir tmp/Qwen3-VL-30B-A3B-Instruct --local-dir-use-symlinks False --include "*.json" "*.safetensors" "preprocessor_config.json"

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models <for model

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models --mmproj < for mmproj

build-vulkan/bin/llama-server -m models/Qwen3-VL-30B-A3B-Instruct-F16.gguf --mmproj models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf --jinja --host 0.0.0.0 --port 8081 -ngl 999 < to run llama.cpp

No GGUFs I found off the shelf were working right until I did this. Hope this helps.

@psi00
Copy link

psi00 commented Oct 27, 2025

@psi00

I have managed to get Qwen3-LV-30B-A3B-instruct running on Ubuntu just now (specifically with a ryzen ai max+ 395 and vulkan). Did you compile your own GGUF/mmproj.gguf using convert_hf_to_gguf.py?

How I prepared mine below

huggingface-cli download Qwen/Qwen3-VL-30B-A3B-Instruct --local-dir tmp/Qwen3-VL-30B-A3B-Instruct --local-dir-use-symlinks False --include "*.json" "*.safetensors" "preprocessor_config.json"

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models <for model

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models --mmproj < for mmproj

build-vulkan/bin/llama-server -m models/Qwen3-VL-30B-A3B-Instruct-F16.gguf --mmproj models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf --jinja --host 0.0.0.0 --port 8081 -ngl 999 < to run llama.cpp

No GGUFs I found off the shelf were working right until I did this. Hope this helps.

Thank you. I was using the GGUFs from NexaAI. May I add though that I think the architecture is different for each model (30B/8B/4B) etc. I will try this though, thanks again

deepstack_features = feat;
} else {
// concat along the feature dimension
deepstack_features = ggml_concat(ctx0, deepstack_features, feat, 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not very important to optimize this right now, but doing ggml_concat at on multiple layers can increase memory usage. one trick is to allocate one big tensor, then use ggml_set_rows to copy the intermediate result into the allocated tensor.

cc @ggerganov , do you think this can be a good idea for concat multiple tensors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I just follow the style of llava

llama.cpp/tools/mtmd/clip.cpp

Lines 1278 to 1285 in 1c1409e

// If feature layers are explicitly set, stack them (if we have multiple)
if (!embedding_stack.empty()) {
embeddings = embedding_stack[0];
for (size_t i = 1; i < embedding_stack.size(); i++) {
embeddings = ggml_concat(ctx0, embeddings, embedding_stack[i], 0);
}
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but llava has fixed number of token (no dynamic resolution), so the memory usage is predictable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I’ll optimize it later. 🫡

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a TODO comment with a reference to this thread to not forget to improve this later.

Copy link
Collaborator

@ngxson ngxson Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
deepstack_features = ggml_concat(ctx0, deepstack_features, feat, 0);
// TODO: pre-allocate memory and use ggml_set_rows, see: https://github.com/ggml-org/llama.cpp/pull/16780/files#r2465886647
deepstack_features = ggml_concat(ctx0, deepstack_features, feat, 0);

@psi00
Copy link

psi00 commented Oct 27, 2025

@PaymonHossaini,
I get another architecture error when trying to quantize:

python .\Qwen3-VL-8B-Instruct\llama.cpp\convert_hf_to_gguf.py --outtype f16 .\Qwen3-VL-8B-Instruct\ --use-temp-file --outfile models
INFO:hf-to-gguf:Loading model: Qwen3-VL-8B-Instruct
INFO:hf-to-gguf:Model architecture: Qwen3VLForConditionalGeneration
ERROR:hf-to-gguf:Model Qwen3VLForConditionalGeneration is not supported

@PaymonHossaini
Copy link

PaymonHossaini commented Oct 27, 2025

@PaymonHossaini, I get another architecture error when trying to quantize:

python .\Qwen3-VL-8B-Instruct\llama.cpp\convert_hf_to_gguf.py --outtype f16 .\Qwen3-VL-8B-Instruct\ --use-temp-file --outfile models
INFO:hf-to-gguf:Loading model: Qwen3-VL-8B-Instruct
INFO:hf-to-gguf:Model architecture: Qwen3VLForConditionalGeneration
ERROR:hf-to-gguf:Model Qwen3VLForConditionalGeneration is not supported
image

While its true the the 30B is MOE and the 8B is dense I was unable to recreate this issue. Make sure your local checkout tracks the PR branch as there were some changes to that script to make it compatible with these models.

My instructions for using 8B model below

huggingface-cli download Qwen/Qwen3-VL-8B-Instruct --local-dir tmp/Qwen3-VL-8B-Instruct --local-dir-use-symlinks False --include "*.json" "*.safetensors" "preprocessor_config.json"

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-8B-Instruct --outtype f16 --use-temp-file --outfile models

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-8B-Instruct --outtype f16 --use-temp-file --outfile models --mmproj

build-vulkan/bin/llama-server -m models/Qwen3-VL-8B-Instruct-F16.gguf --mmproj models/mmproj-Qwen3-VL-8b-Instruct-F16.gguf --jinja --host 0.0.0.0 --port 8081 -ngl 999

I don't belive this issue is a result of the code changes.

@LETS-BEE
Copy link
Contributor

For some reason, this version's OCR capability is not as good as the previous LETS-BEE version; it noticeably misses characters and exhibits infinite repetition.

I think merging PR #16745 will likely reflect the model's original performance.
However, I don't know why, but using nearest-neighbor interpolation instead of bilinear interpolation in resize position embeddings seems to yield better performance.

@JJJYmmm JJJYmmm requested a review from 0cc4m as a code owner October 30, 2025 11:01
@JJJYmmm
Copy link
Contributor Author

JJJYmmm commented Oct 30, 2025

@CISC I’ve updated the corresponding file, but haven’t tested it yet since I don’t have a vulkan env at the moment.

@0cc4m
Copy link
Collaborator

0cc4m commented Oct 30, 2025

GLSL cannot automatically convert integers to bool, so you need the full condition, for example if (p.is_imrope) { has to be if (p.is_imrope != 0) {

Comment on lines +4062 to +4064
self.is_deepstack_layers = [False] * int(self.hparams_vision["num_hidden_layers"] or 0)
for idx in self.hparams_vision.get("deepstack_visual_indexes", []):
self.is_deepstack_layers[idx] = True
Copy link
Collaborator

@ngxson ngxson Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(No actions is needed, just a side note here)

The is_deepstack_layers metadata is no longer being used in clip.cpp, as I want to make the code more simple to maintain. We now use the same logic as MoE in llama.cpp, where if the tensor is not present, it will be nullptr, and this will trigger the code branch for deepstack layers

Bu we will still keep this metadata in GGUF for future use

@JJJYmmm JJJYmmm requested a review from reeselevine as a code owner October 30, 2025 11:31
Co-authored-by: Sigbjørn Skjæret <[email protected]>
@pt13762104
Copy link
Contributor

I see an error in the mmproj creation: ValueError: Can not map tensor 'visual.blocks.0.attn.qkv.bias'

@github-actions github-actions bot added the Vulkan Issues specific to the Vulkan backend label Oct 30, 2025
@ngxson
Copy link
Collaborator

ngxson commented Oct 30, 2025

As a reminder, you can also add different backends support in follow-up PRs, to avoid adding too many reviewers into one PR (More preferable, one PR per backend)

@ngxson
Copy link
Collaborator

ngxson commented Oct 30, 2025

I'm merging this in the next 30mn - 1hr as the CI for test-backend-ops already passed. Thanks for the contribution @JJJYmmm !

@JJJYmmm
Copy link
Contributor Author

JJJYmmm commented Oct 30, 2025

Thank you all for the detailed review! 🙏

@SharkWipf
Copy link

I never watched a llama.cpp PR thread before, never realized how well-organized and dedicated you all are, just wanted to chime in to say: you all rock and your effort is appreciated.

@mpapili
Copy link

mpapili commented Oct 30, 2025

I never watched a llama.cpp PR thread before, never realized how well-organized and dedicated you all are, just wanted to chime in to say: you all rock and your effort is appreciated.

Ditto. I've been watching this PR like a hawk. Great contributors and great maintainers all around.

@ngxson ngxson merged commit d261223 into ggml-org:master Oct 30, 2025
71 of 73 checks passed
@RodriMora
Copy link
Contributor

I believe the requirements.txt needs top be updated, the current transformers version does not have support for the qwen3-vl architecture. Not a problem for inference, but for quantizing it will not recognize the arch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: support qwen3-vl series